Automatically Assessing Machine Summary Content Without a Gold Standard

نویسندگان

  • Annie Louis
  • Ani Nenkova
چکیده

The most widely adopted approaches for evaluation of summary content follow some protocol for comparing a summary with gold-standard human summaries, which are traditionally called model summaries. This evaluation paradigm falls short when human summaries are not available and becomes less accurate when only a single model is available. We propose three novel evaluation techniques. Two of them are model-free and do not rely on a gold standard for the assessment. The third technique improves standard automatic evaluations by expanding the set of available model summaries with chosen system summaries. We show that quantifying the similarity between the source text and its summary with appropriately chosen measures produces summary scores which replicate human assessments accurately. We also explore ways of increasing evaluation quality when only one human model summary is available as a gold standard. We introduce pseudomodels, which are system summaries deemed to contain good content according to automatic evaluation. Combining the pseudomodels with the single human model to form the gold-standard leads to higher correlations with human judgments compared to using only the one available model. Finally, we explore the feasibility of another measure—similarity between a system summary and the pool of all other system summaries for the same input. This method of comparison with the consensus of systems produces impressively accurate rankings of system summaries, achieving correlation with human rankings above 0.9.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generating and Validating Abstracts of Meeting Conversations: a User Study

In this paper we present a complete system for automatically generating natural language abstracts of meeting conversations. This system is comprised of components relating to interpretation of the meeting documents according to a meeting ontology, transformation or content selection from that source representation to a summary representation, and generation of new summary text. In a formative ...

متن کامل

Entailment-based Fully Automatic Technique for Evaluation of Summaries

We propose a fully automatic technique for evaluating text summaries without the need to prepare the gold standard summaries manually. A standard and popular summary evaluation techniques or tools are not fully automatic; they all need some manual process or manual reference summary. Using recognizing textual entailment (TE), automatically generated summaries can be evaluated completely automat...

متن کامل

ResQu: A Framework for Automatic Evaluation of Knowledge-Driven Automatic Summarization

JAYKUMAR, NISHITA. M.S., Department of Computer Science and Engineering, Wright State University, 2016. ResQu: A Framework for Automatic Evaluation of Knowledge-Driven Automatic Summarization. Automatic generation of summaries that capture the salient aspects of a search resultset (i.e., automatic summarization) has become an important task in biomedical research. Automatic summarization offers...

متن کامل

Assessing the practical usability of an automatically annotated corpus

The creation of a gold standard corpus (GSC) is a very laborious and costly process. Silver standard corpus (SSC) annotation is a very recent direction of corpus development which relies on multiple systems instead of human annotators. In this paper, we investigate the practical usability of an SSC when a machine learning system is trained on it and tested on an unseen benchmark GSC. The main f...

متن کامل

Research Paper: A Comparison of Citation Metrics to Machine Learning Filters for the Identification of High Quality MEDLINE Documents

OBJECTIVE The present study explores the discriminatory performance of existing and novel gold-standard-specific machine learning (GSS-ML) focused filter models (i.e., models built specifically for a retrieval task and a gold standard against which they are evaluated) and compares their performance to citation count and impact factors, and non-specific machine learning (NS-ML) models (i.e., mod...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computational Linguistics

دوره 39  شماره 

صفحات  -

تاریخ انتشار 2013